This document is intended for the cleaning of data scraped from yelp.

The Data

The general layout of the data is as follows:

yelp <- read.csv('../data/yelpData.csv', stringsAsFactors = FALSE)
colnames(yelp) <- tolower(colnames(yelp))
## drop breweries that couldn't be scraped
yelp <-yelp[!is.na(yelp$ratings),]
yelp
yelp %>% 
  group_by(name) %>% 
  summarize(n()) %>% 
  nrow()
## [1] 2465

Of the nearly 8,000 breweries on open brewery db, I was able to successfully scrape 2465 of them on the first pass.

Important Variables

Off the bat, I was a little worried because rating appears null across the board, but I see that it’s redundant with ratings, which is present. Some notes about specific vars:

I’ll dig into formatting the hours first.

Hours of Operation

One straight-forward task to start with is converting the hours for each day into new vars that just have the total amount of hours open on that day. Later on I’ll want to consider how to capture the information about how early / late those hours are.

print(typeof(yelp$mon))
## [1] "character"
print(class(yelp$mon))
## [1] "character"
s1 <- yelp$mon[1]
s2 <- yelp$tue[1]
print(c(s1,s2))
## [1] "12:00 pm - 10:00 pm\n        \n                Open now"
## [2] "12:00 pm - 10:00 pm"

Alright so, for all of these, I’ll want to keep only the data before the first \n, then I’ll want to split the range into two vars for the beginning and the end.

Then convert to time, compute the total hours that a brewery is open, reshape it all so that all hours-of-operation information for each brewery is on one row, and join it back with the main data.

## Regex to extract time
simpleTime <- '\\d+:\\d\\d\\s\\w\\w'

## keep one row per brewery
test <- yelp[match(unique(yelp$name), yelp$name),]

## Regex and time conversion to get time data to behave like time data

test <- test %>% 
  select(name, mon, tue, wed, thu, fri, sat, sun) %>% 
  gather(day, times, mon:sun) %>% 
  ## Separate into open and close times based on the '-'
  separate(times, into=c('open', 'close'), sep='-') %>% 
  ## match only strings matching time format
  mutate(open = str_extract(open, simpleTime),
         close = str_extract(close, simpleTime))# %>% 
## Warning: Expected 2 pieces. Additional pieces discarded in 31 rows [848,
## 1644, 2133, 2661, 3313, 4109, 4598, 5126, 5156, 5778, 6574, 7063, 7591,
## 7621, 8243, 9039, 9528, 10056, 10708, 11238, ...].
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 3075 rows
## [4, 10, 15, 17, 18, 19, 21, 24, 26, 27, 28, 34, 36, 38, 41, 42, 44, 52, 63,
## 69, ...].
## Convert to time
test$open <- strptime(test$open, format = '%I:%M %p')
test$close <- strptime(test$close, format = '%I:%M %p')
## Strip out the date, keep only the time
test$open <- times(sub('.*\\s+', '', test$open))
test$close <- times(sub('.*\\s+', '', test$close))

## Doing some serious data reshaping to get time information on one line per brewery

## Goal is to spread time open, time closed, and total time to wide vars of the form: day_open/close_TotalTime/Not
yelp <- x <- test %>% 
  ## Compute total hours open
  mutate(totalHours = abs(test$close - test$open)) %>% 
  ## Gather open / close times
  gather(timePeriod, time, open:close) %>% 
  ## Gather total hours and open / close times
  gather(TotalOrTime, value, totalHours, time) %>% 
  ## Combine day, open / close times, total time
  unite(variable, day, timePeriod, TotalOrTime) %>% 
  ## Spread across the united variable
  spread(variable, value) %>% 
  inner_join(yelp)
## Joining, by = "name"
## open and close for total hours are the same thing
sum(yelp$thu_open_totalHours == yelp$thu_close_totalHours, na.rm = TRUE) == nrow(yelp[!is.na(yelp$thu_open_totalHours),])
## [1] TRUE
yelp <- yelp %>% 
  select(-contains('_close_totalhours'))

yelp

Whew, that was tough.

Price Range

 I’m going to get together a function for sorting the various levels of price_range into bins to be used as an ordinal factor.

print(summary(factor(yelp$price_range)))
##                        $11-30         $31-60        A$16-35    Inexpensive 
##            722          13837            159            100           9189 
##       Moderate         Pricey       SGD16-30 Ultra High-End      Under $10 
##           8075             65              5             14            765

I think the general idea will be to make three levels: ['cheap', 'moderate', 'expensive'] where the cutoffs are as follows:

cleanPriceRange <- function(price_range){
  ### takes as input char vector price_range
  ### returns price range (still as a char var) sorted into three groups
  
  cheapBucket <- c('under $10', 'inexpensive')
  moderateBucket <- c('$11-30', 'A$16-35', 'Moderate')
  expensiveBucket <- c('$31-60')
  
  price_range <- tolower(price_range)
  
  ## regrettably (bc we need vectorized operations) there's no great way around the disgusting ifelse() chain
  out <- ifelse(price_range %in% cheapBucket, 'cheap',
                ifelse(price_range %in% moderateBucket, 'moderate',
                       ifelse(price_range %in% expensiveBucket, 'expensive', NA)))
  
  return(out)
}
yelp <- yelp %>% 
  mutate(price_range_messy = price_range) %>% 
  select(-price_range) %>% 
  mutate(price_range = cleanPriceRange(price_range_messy))

print(summary(factor(yelp$price_range)))
##     cheap expensive  moderate      NA's 
##      9954       159     13837      8981
yelp[500:550,]
yelp <- yelp %>% 
  select(-price_range_messy)

Alright that feature should be mostly good to go. Might want to dummy code it out later, but for now it’s fine.

More Info

Time to deal with the nuanced info vars. The mission will be to try and convert them to wide.. but we’ll see how it goes.

I’ll need to get a sense of the degree of variablity in the moreinfovar variable.

summary(factor(yelp$moreinfovar))
##           Accepted Cards        Accepts Apple Pay     Accepts Credit Cards 
##                       12                     1950                     2500 
##   Accepts Cryptocurrency       Accepts Google Pay                  Alcohol 
##                      699                     1137                      877 
##                 Ambience                   Attire              Best Nights 
##                      724                      578                      328 
##             Bike Parking      By Appointment Only                   Caters 
##                     2335                        7                     2201 
##               Coat Check                 Delivery             Dogs Allowed 
##                      307                      633                     1113 
##               Drive-Thru Gender Neutral Restrooms                 Good For 
##                       44                      696                      532 
##         Good For Dancing          Good for Groups      Good For Happy Hour 
##                      383                      888                     1008 
##            Good for Kids         Good for Working      Happy Hour Specials 
##                     1080                       11                       42 
##   Has Dairy-free Options  Has Gluten-free Options        Has Halal Options 
##                       78                       93                        7 
##         Has Keto Options       Has Kosher Options           Has Pool Table 
##                       11                       10                      357 
##     Has Soy-free Options                   Has TV                 Legal ID 
##                       18                      882                        1 
##          Liked by Vegans     Liked by Vegetarians                    Music 
##                      157                      333                       61 
##              Noise Level Offers Military Discount              Open to All 
##                     1217                      182                      154 
##          Outdoor Seating                  Parking                  Smoking 
##                      904                     2117                      297 
##                 Take-out       Takes Reservations           Waiter Service 
##                     2472                      850                      407 
##    Wheelchair Accessible                    Wi-Fi 
##                     1352                      886

Hmm, yea this is tough. Some of these (e.g., Has Soy-Free Options & Has Keto Options) are only represented very few times. Trying to convert these to wide will result in NAs pretty much across the board. I do want to avoid pruning these, as there is much more data yet to be collected. I might just try to convert to wide, then implement a NA quality threshold before moving on to analysis. The basic idea will be to drop vars with too many NAs, and tackle imputing values for those that seem worth saving.

Here we go:

## Some breweries have redundant entries for moreinfovar (80 total), need to drop these for spread to work 
badBrews <- yelp %>% 
  group_by(name) %>% 
  summarize(dups = sum(duplicated(moreinfovar))) %>% 
  filter(dups >= 1) %>% 
  select(name)

yelp <- yelp[!(yelp$name %in% badBrews$name),]

yelp <- yelp %>% 
  spread(moreinfovar, moreinfoval) 

colnames(yelp) <- tolower(colnames(yelp))

yelp

This seems surprisingly alright. Luckily, most of these features seem to be coded as Yes / No – I’ll have to think about how to handle the ones that aren’t (e.g., Parking)

The following function computes the proportion of missing data (from most to least) for all variables:

realizing i could’ve done this with sapply(), but this formats it nicely

summarizeNAs <- function(data) {
  ## Returns a sorted df where column one is the name of the original column and column two is the proportion of data missing from that column
  
  holder <- data.frame(colName = character(), propMissing = numeric(), stringsAsFactors = FALSE)
  count <- 0
  for (colIndex in 1:(ncol(data))) {
    count <- count+ 1
    holder[count, 1] <- colnames(data)[colIndex]
    holder[count, 2] <- nrow(data[is.na(data[,colIndex]),]) / nrow(data)
  }
  return(holder[order(holder$propMissing, decreasing = TRUE),])
}

naSummary <- summarizeNAs(yelp)
summarizeNAs(yelp)

Unfortunately, as of now, there are only very few of these special info variables (5) that have < 20% of values missing. For now, I’m going to drop all vars > 50% missing values.

yelp <- yelp[, !(colnames(yelp) %in% naSummary[naSummary$propMissing > .5, 1])]

This leaves the following vars to work with:

colnames(yelp)
##  [1] "name"                  "fri_close_time"       
##  [3] "fri_open_time"         "fri_open_totalhours"  
##  [5] "mon_close_time"        "mon_open_time"        
##  [7] "mon_open_totalhours"   "sat_close_time"       
##  [9] "sat_open_time"         "sat_open_totalhours"  
## [11] "sun_close_time"        "sun_open_time"        
## [13] "sun_open_totalhours"   "thu_close_time"       
## [15] "thu_open_time"         "thu_open_totalhours"  
## [17] "tue_close_time"        "tue_open_time"        
## [19] "tue_open_totalhours"   "wed_close_time"       
## [21] "wed_open_time"         "wed_open_totalhours"  
## [23] "x1_star_count"         "x2_star_count"        
## [25] "x3_star_count"         "x4_star_count"        
## [27] "x5_star_count"         "fri"                  
## [29] "mon"                   "sat"                  
## [31] "sun"                   "thu"                  
## [33] "tue"                   "wed"                  
## [35] "category"              "city"                 
## [37] "claimed_status"        "health_rating"        
## [39] "ratings"               "reviews"              
## [41] "street"                "price_range"          
## [43] "accepts apple pay"     "accepts credit cards" 
## [45] "bike parking"          "caters"               
## [47] "parking"               "take-out"             
## [49] "wheelchair accessible"

Writing a quick mode-imputation function:

getmode <- function(v) {
   v <- na.omit(v)
   uniqv <- unique(v)
   ## tabulate takes the integer-valued vector bin and counts the number of times each integer occurs in it.
   uniqv[which.max(tabulate(match(v, uniqv)))]
}


modeImputation <- function(data){
  for (column in colnames(data)) {
    #print(column)
    modeValue <- getmode(data[,column])
    data[is.na(data[,column]), column] <- modeValue #rep(modeValue, length(data[is.na(data$column), column]))
  }
  return(data)
}

yelp <- modeImputation(yelp)
summarizeNAs(yelp)

Very good.

Quickly dealing with the parking variable:

yelp <- yelp %>% 
  ## If there's more than one parking option just label it 'multiple'
  mutate(parking = ifelse(grepl(',', parking), 'multiple', parking))

Let’s convert some of these to dummy variables:

colnames(yelp) <- gsub('-', '_', colnames(yelp))
colnames(yelp) <- tolower(gsub(' ', '_', colnames(yelp)))

dummyCols <- yelp %>% 
  ## im gonna leave the category var alone for now 
  select(price_range:wheelchair_accessible)

yelp <- dummy_cols(yelp, select_columns = colnames(dummyCols), remove_first_dummy = TRUE)

colnames(yelp) <- tolower(colnames(yelp))

## Converting so dummy vars reflect yes value rather than no
yelp <- yelp %>% 
  mutate(accepts_credit_cards_yes = ifelse(accepts_credit_cards_no == 0, 1, 0),
         bike_parking_yes = ifelse(bike_parking_no == 1, 0, 1),
         wheelchair_accessible_yes = ifelse(wheelchair_accessible_no == 1, 0, 1),
         accepts_apple_pay_yes = ifelse(accepts_apple_pay_no == 1, 0, 1)) %>% 
  select(-bike_parking_no, -accepts_credit_cards_no, -claimed_status, -accepts_apple_pay_no, -wheelchair_accessible_no, -(fri:wed))

Deal with Reviews

yelp <- yelp %>% 
  mutate(reviews = as.numeric(str_extract(reviews, '\\d+')))#,
        #my_ratings = (x1_star_count * 1 + x2_star_count * 2 + x3_star_count * 3 + x4_star_count * 4 + x5_star_count * 5) / sum(c(x1_star_count,x2_star_count,x3_star_count,x4_star_count,x5_star_count)))
## Had to resort to a for loop to calculate weighted ratings 
store <- numeric()
for (row in 1:(nrow(yelp))) {
  store[row] <- with(yelp[row,], (x1_star_count * 1 + x2_star_count * 2 + x3_star_count * 3 + x4_star_count * 4 + x5_star_count * 5) / sum(c(x1_star_count,x2_star_count,x3_star_count,x4_star_count,x5_star_count)))
}

yelp$my_ratings <- store

yelp %>% 
  gather(rating_type, rating, ratings, my_ratings) %>% 
  mutate(rating_type = fct_recode(rating_type, 'Yelp Ratings' = 'ratings', 'My Converted Ratings' = 'my_ratings')) %>% 
  ggplot(aes(x = rating)) + geom_density(fill = 'blue', alpha = .8) + xlab('Ratings') + facet_wrap(~rating_type) + 
  theme_bw() +
  theme(strip.background = element_rect(fill = 'white', color = 'black')) 

Calculating the weighted scores by hand allows for more variablility.

I think that pretty much wraps it up.

yelp

Merging with some of the original data

breweries <- read.csv('../../../data/breweries.csv')
census <- read.csv('../../../data/censusData.csv')
colnames(census) <- tolower(colnames(census))
census <- census[!(census$name %in% c('United States', 'Puerto Rico Commonwealth', 'District of Columbia')),]
census <- census %>% 
  select(name, popestimate2017) %>% 
  rename(state = name, population = popestimate2017)

yelp <- breweries %>% 
  select(name, state) %>% 
  inner_join(yelp, by = 'name')
## Warning: Column `name` joining factor and character vector, coercing into
## character vector
breweries <- breweries %>% 
  inner_join(census) %>% 
  select(name, state, population) %>% 
  group_by(state) %>% 
  summarize(population = max(population), nBreweries = n()) %>% 
  mutate(breweriesToPpl = nBreweries / population)
## Joining, by = "state"
## Warning: Column `state` joining factors with different levels, coercing to
## character vector
yelp <- yelp %>% 
  inner_join(breweries, by = 'state')
## Warning: Column `state` joining factor and character vector, coercing into
## character vector
write.csv(yelp, '../data/cleanYelp.csv', row.names = FALSE)